Revisiting the coffee dataset
Revisiting the dataset to see if total cupping score can be predicted from the variables in the dataset.
What I know so far:
Total Cupping Score is the summation of scores from the 10 attributes: aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean_cup, sweetness and cupper_points.
Quakers refer to immature/unripe beans. This is a defect variable, same as category_one_defect and category_two_defect. The more the defects, the lower the total cupping score.
To carry out data cleaning and EDA for this dataset
To practice using tidymodels framework to predict total cupping score
Most of the exercises I encountered online use the 10 attributes as part of prediction for the total cupping score, but I decided to leave them out due to the way it is calculated. I wanted to look at whether other variables, such as country of origin, processing methods, moisture etc can predict the total cupping score.
However, towards the end of the exercises, when I was evaluating the model, I realised that my model is really quite sub-optimal. The r-sq was ridiculously low, and the rmse is about 3.
What could the reason be? Perhaps the choice of variables was not suitable, perhaps I did not engineer my features well enough? I think I already chose the easiest algorithm, as the dataset was very skewed and had a lot of categories, so I decided to use random forest, which doesn’t require a lot of feature engineering…
Perhaps these variables are not enough to predict the cupping score? Quality isn’t just a measure of country, region, processing methods, although they can give some indications as to whether the coffee beans are good… I’m probably missing some additional information…
Other exercises that used the 10 attributes had very good R-sq values, and understandably so, because total cupping points are calculated from these attributes by summing them up…
Nevertheless, I shall take it as an exercise using the tidymodels framework, and let me mull over how my ML can be improved…
coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
Extracting column names
# extract col names-----
coffee_col_names <- colnames(coffee_ratings) %>%
as_tibble(.)
coffee_col_names %>%
print(n = nrow(.))
# A tibble: 43 × 1
value
<chr>
1 total_cup_points
2 species
3 owner
4 country_of_origin
5 farm_name
6 lot_number
7 mill
8 ico_number
9 company
10 altitude
11 region
12 producer
13 number_of_bags
14 bag_weight
15 in_country_partner
16 harvest_year
17 grading_date
18 owner_1
19 variety
20 processing_method
21 aroma
22 flavor
23 aftertaste
24 acidity
25 body
26 balance
27 uniformity
28 clean_cup
29 sweetness
30 cupper_points
31 moisture
32 category_one_defects
33 quakers
34 color
35 category_two_defects
36 expiration
37 certification_body
38 certification_address
39 certification_contact
40 unit_of_measurement
41 altitude_low_meters
42 altitude_high_meters
43 altitude_mean_meters
Initial look at the data
skim(coffee_ratings)
Name | coffee_ratings |
Number of rows | 1339 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 24 |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1.00 | 7 | 7 | 0 | 2 | 0 |
owner | 7 | 0.99 | 3 | 50 | 0 | 315 | 0 |
country_of_origin | 1 | 1.00 | 4 | 28 | 0 | 36 | 0 |
farm_name | 359 | 0.73 | 1 | 73 | 0 | 571 | 0 |
lot_number | 1063 | 0.21 | 1 | 71 | 0 | 227 | 0 |
mill | 315 | 0.76 | 1 | 77 | 0 | 460 | 0 |
ico_number | 151 | 0.89 | 1 | 40 | 0 | 847 | 0 |
company | 209 | 0.84 | 3 | 73 | 0 | 281 | 0 |
altitude | 226 | 0.83 | 1 | 41 | 0 | 396 | 0 |
region | 59 | 0.96 | 2 | 76 | 0 | 356 | 0 |
producer | 231 | 0.83 | 1 | 100 | 0 | 691 | 0 |
bag_weight | 0 | 1.00 | 1 | 8 | 0 | 56 | 0 |
in_country_partner | 0 | 1.00 | 7 | 85 | 0 | 27 | 0 |
harvest_year | 47 | 0.96 | 3 | 24 | 0 | 46 | 0 |
grading_date | 0 | 1.00 | 13 | 20 | 0 | 567 | 0 |
owner_1 | 7 | 0.99 | 3 | 50 | 0 | 319 | 0 |
variety | 226 | 0.83 | 4 | 21 | 0 | 29 | 0 |
processing_method | 170 | 0.87 | 5 | 25 | 0 | 5 | 0 |
color | 218 | 0.84 | 4 | 12 | 0 | 4 | 0 |
expiration | 0 | 1.00 | 13 | 20 | 0 | 566 | 0 |
certification_body | 0 | 1.00 | 7 | 85 | 0 | 26 | 0 |
certification_address | 0 | 1.00 | 40 | 40 | 0 | 32 | 0 |
certification_contact | 0 | 1.00 | 40 | 40 | 0 | 29 | 0 |
unit_of_measurement | 0 | 1.00 | 1 | 2 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1.00 | 82.09 | 3.50 | 0 | 81.08 | 82.50 | 83.67 | 90.58 | ▁▁▁▁▇ |
number_of_bags | 0 | 1.00 | 154.18 | 129.99 | 0 | 14.00 | 175.00 | 275.00 | 1062.00 | ▇▇▁▁▁ |
aroma | 0 | 1.00 | 7.57 | 0.38 | 0 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
flavor | 0 | 1.00 | 7.52 | 0.40 | 0 | 7.33 | 7.58 | 7.75 | 8.83 | ▁▁▁▁▇ |
aftertaste | 0 | 1.00 | 7.40 | 0.40 | 0 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▁▁▁▇ |
acidity | 0 | 1.00 | 7.54 | 0.38 | 0 | 7.33 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
body | 0 | 1.00 | 7.52 | 0.37 | 0 | 7.33 | 7.50 | 7.67 | 8.58 | ▁▁▁▁▇ |
balance | 0 | 1.00 | 7.52 | 0.41 | 0 | 7.33 | 7.50 | 7.75 | 8.75 | ▁▁▁▁▇ |
uniformity | 0 | 1.00 | 9.83 | 0.55 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
clean_cup | 0 | 1.00 | 9.84 | 0.76 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
sweetness | 0 | 1.00 | 9.86 | 0.62 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
cupper_points | 0 | 1.00 | 7.50 | 0.47 | 0 | 7.25 | 7.50 | 7.75 | 10.00 | ▁▁▁▇▁ |
moisture | 0 | 1.00 | 0.09 | 0.05 | 0 | 0.09 | 0.11 | 0.12 | 0.28 | ▃▇▅▁▁ |
category_one_defects | 0 | 1.00 | 0.48 | 2.55 | 0 | 0.00 | 0.00 | 0.00 | 63.00 | ▇▁▁▁▁ |
quakers | 1 | 1.00 | 0.17 | 0.83 | 0 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1.00 | 3.56 | 5.31 | 0 | 0.00 | 2.00 | 4.00 | 55.00 | ▇▁▁▁▁ |
altitude_low_meters | 230 | 0.83 | 1750.71 | 8669.44 | 1 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
altitude_high_meters | 230 | 0.83 | 1799.35 | 8668.81 | 1 | 1100.00 | 1350.00 | 1650.00 | 190164.00 | ▇▁▁▁▁ |
altitude_mean_meters | 230 | 0.83 | 1775.03 | 8668.63 | 1 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
# count -----
coffee_ratings %>%
count(species) # mostly arabica
# A tibble: 2 × 2
species n
<chr> <int>
1 Arabica 1311
2 Robusta 28
coffee_ratings %>%
count(owner) %>%
arrange(desc(n)) # 316 unique owners
# A tibble: 316 × 2
owner n
<chr> <int>
1 juan luis alvarado romero 155
2 racafe & cia s.c.a 60
3 exportadora de cafe condor s.a 54
4 kona pacific farmers cooperative 52
5 ipanema coffees 50
6 cqi taiwan icp cqi台灣合作夥伴 47
7 lin, che-hao krude 林哲豪 30
8 nucoffee 29
9 carcafe ltda ci 27
10 the coffee source inc. 23
# … with 306 more rows
# create a function to count and sort
count_and_sort <- function(col){
coffee_ratings %>%
count({{col}}) %>%
arrange(desc(n))
}
# 37 unique countries
count_and_sort(country_of_origin)
# A tibble: 37 × 2
country_of_origin n
<chr> <int>
1 Mexico 236
2 Colombia 183
3 Guatemala 181
4 Brazil 132
5 Taiwan 75
6 United States (Hawaii) 73
7 Honduras 53
8 Costa Rica 51
9 Ethiopia 44
10 Tanzania, United Republic Of 40
# … with 27 more rows
count_and_sort(farm_name) # many missing values for farm name, 572 unique farms
# A tibble: 572 × 2
farm_name n
<chr> <int>
1 <NA> 359
2 various 47
3 rio verde 23
4 several 20
5 finca medina 15
6 doi tung development project 13
7 fazenda capoeirnha 13
8 conquista / morito 11
9 los hicaques 11
10 capoeirinha 10
# … with 562 more rows
count_and_sort(lot_number)
# A tibble: 228 × 2
lot_number n
<chr> <int>
1 <NA> 1063
2 1 18
3 020/17 6
4 019/17 5
5 102 3
6 103 3
7 2 3
8 2016 Tainan Coffee Cupping Event Micro Lot 臺南市咖啡評鑑批次 3
9 017-053-0046 2
10 1-198 2
# … with 218 more rows
count_and_sort(owner)
# A tibble: 316 × 2
owner n
<chr> <int>
1 juan luis alvarado romero 155
2 racafe & cia s.c.a 60
3 exportadora de cafe condor s.a 54
4 kona pacific farmers cooperative 52
5 ipanema coffees 50
6 cqi taiwan icp cqi台灣合作夥伴 47
7 lin, che-hao krude 林哲豪 30
8 nucoffee 29
9 carcafe ltda ci 27
10 the coffee source inc. 23
# … with 306 more rows
count_and_sort(farm_name)
# A tibble: 572 × 2
farm_name n
<chr> <int>
1 <NA> 359
2 various 47
3 rio verde 23
4 several 20
5 finca medina 15
6 doi tung development project 13
7 fazenda capoeirnha 13
8 conquista / morito 11
9 los hicaques 11
10 capoeirinha 10
# … with 562 more rows
count_and_sort(species)
# A tibble: 2 × 2
species n
<chr> <int>
1 Arabica 1311
2 Robusta 28
count_and_sort(mill)
# A tibble: 461 × 2
mill n
<chr> <int>
1 <NA> 315
2 beneficio ixchel 90
3 dry mill 39
4 trilladora boananza 38
5 ipanema coffees 16
6 neiva 15
7 bachue 14
8 beneficio siembras vision (154) 12
9 cadexsa 12
10 cigrah s.a de c.v. 12
# … with 451 more rows
count_and_sort(ico_number)
# A tibble: 848 × 2
ico_number n
<chr> <int>
1 <NA> 151
2 0 77
3 Taiwan 31
4 2222 11
5 - 9
6 002/1660/0105 7
7 002/4177/0150 7
8 Taiwan台灣 7
9 002/1660/0080 6
10 1 6
# … with 838 more rows
count_and_sort(company)
# A tibble: 282 × 2
company n
<chr> <int>
1 <NA> 209
2 unex guatemala, s.a. 86
3 ipanema coffees 50
4 exportadora de cafe condor s.a 40
5 kona pacific farmers cooperative 40
6 racafe & cia s.c.a 40
7 blossom valley宸嶧國際 25
8 carcafe ltda 25
9 nucoffee 24
10 taiwan coffee laboratory 20
# … with 272 more rows
count_and_sort(altitude)
# A tibble: 397 × 2
altitude n
<chr> <int>
1 <NA> 226
2 1100 43
3 1200 42
4 1300 32
5 1400 32
6 4300 31
7 1250 30
8 1500 30
9 1700 28
10 1550 24
# … with 387 more rows
count_and_sort(region)
# A tibble: 357 × 2
region n
<chr> <int>
1 huila 112
2 oriente 80
3 south of minas 68
4 kona 66
5 <NA> 59
6 veracruz 35
7 tarrazu 19
8 comayagua 17
9 huehuetenango 16
10 san marcos 16
# … with 347 more rows
count_and_sort(producer)
# A tibble: 692 × 2
producer n
<chr> <int>
1 <NA> 231
2 La Plata 30
3 Ipanema Agrícola SA 22
4 Doi Tung Development Project 17
5 Ipanema Agricola 12
6 VARIOS 12
7 Ipanema Agricola S.A 11
8 ROBERTO MONTERROSO 10
9 AMILCAR LAPOLA 9
10 LA PLATA 9
# … with 682 more rows
count_and_sort(bag_weight)
# A tibble: 56 × 2
bag_weight n
<chr> <int>
1 1 kg 331
2 60 kg 256
3 69 kg 200
4 70 kg 156
5 2 kg 122
6 100 lbs 59
7 30 kg 29
8 5 lbs 23
9 6 19
10 20 kg 14
# … with 46 more rows
count_and_sort(in_country_partner)
# A tibble: 27 × 2
in_country_partner n
<chr> <int>
1 Specialty Coffee Association 313
2 AMECAFE 205
3 Almacafé 178
4 Asociacion Nacional Del Café 155
5 Brazil Specialty Coffee Association 67
6 Instituto Hondureño del Café 60
7 Blossom Valley International 58
8 Africa Fine Coffee Association 49
9 Specialty Coffee Association of Costa Rica 42
10 NUCOFFEE 36
# … with 17 more rows
count_and_sort(harvest_year)
# A tibble: 47 × 2
harvest_year n
<chr> <int>
1 2012 354
2 2014 233
3 2013 181
4 2015 129
5 2016 124
6 2017 70
7 <NA> 47
8 2013/2014 29
9 2015/2016 28
10 2011 26
# … with 37 more rows
count_and_sort(grading_date)
# A tibble: 567 × 2
grading_date n
<chr> <int>
1 July 11th, 2012 25
2 December 26th, 2013 24
3 June 6th, 2012 19
4 August 30th, 2012 18
5 July 26th, 2012 15
6 March 29th, 2013 13
7 October 8th, 2015 13
8 September 27th, 2012 13
9 June 17th, 2010 12
10 October 20th, 2017 11
# … with 557 more rows
count_and_sort(owner_1)
# A tibble: 320 × 2
owner_1 n
<chr> <int>
1 Juan Luis Alvarado Romero 155
2 Racafe & Cia S.C.A 60
3 Exportadora de Cafe Condor S.A 54
4 Kona Pacific Farmers Cooperative 52
5 Ipanema Coffees 50
6 CQI Taiwan ICP CQI台灣合作夥伴 46
7 Lin, Che-Hao Krude 林哲豪 29
8 NUCOFFEE 29
9 CARCAFE LTDA CI 27
10 The Coffee Source Inc. 23
# … with 310 more rows
count_and_sort(variety)
# A tibble: 30 × 2
variety n
<chr> <int>
1 Caturra 256
2 Bourbon 226
3 <NA> 226
4 Typica 211
5 Other 110
6 Catuai 74
7 Hawaiian Kona 44
8 Yellow Bourbon 35
9 Mundo Novo 33
10 Catimor 20
# … with 20 more rows
count_and_sort(processing_method)
# A tibble: 6 × 2
processing_method n
<chr> <int>
1 Washed / Wet 815
2 Natural / Dry 258
3 <NA> 170
4 Semi-washed / Semi-pulped 56
5 Other 26
6 Pulped natural / honey 14
count_and_sort(color)
# A tibble: 5 × 2
color n
<chr> <int>
1 Green 870
2 <NA> 218
3 Bluish-Green 114
4 Blue-Green 85
5 None 52
count_and_sort(expiration)
# A tibble: 566 × 2
expiration n
<chr> <int>
1 December 26th, 2014 25
2 July 11th, 2013 25
3 June 6th, 2013 19
4 August 30th, 2013 18
5 July 26th, 2013 15
6 March 29th, 2014 13
7 October 7th, 2016 13
8 September 27th, 2013 13
9 June 17th, 2011 12
10 October 20th, 2018 11
# … with 556 more rows
count_and_sort(certification_body) %>% datatable()
count_and_sort(unit_of_measurement)
# A tibble: 2 × 2
unit_of_measurement n
<chr> <int>
1 m 1157
2 ft 182
Robusta vs Arabica
# robusta vs arabica -----
coffee_ratings %>%
ggplot(aes(total_cup_points, fill = species)) +
geom_density(alpha = 0.4)
Change to correct columns and remove redundant columns
Filter to just include Arabica, since Robusta is traditionally a more inferior grade
coffee_a <-
coffee_ratings %>%
dplyr::filter(species == "Arabica") %>%
mutate(across(where(is.character), as.factor)) %>%
mutate_at(c("altitude"), as.numeric) %>%
mutate_at(c("harvest_year"), as.character) %>%
mutate(grading_date = lubridate::mdy(grading_date)) %>%
mutate(expiration = lubridate::mdy(expiration)) %>%
# collapse less than 10 into 1 category
mutate(certification_body = fct_lump_min(certification_body,10)) %>%
select(-lot_number,
-ico_number,
-bag_weight,
-farm_name,
-mill,
-producer,
-owner_1,
-certification_address,
-certification_contact
)
Next, to clean up the harvest years.
harvest_years <-
coffee_a %>%
select(harvest_year,
# grading_date
) %>%
unique() %>%
print(nrow = nrow(.))
# A tibble: 47 × 1
harvest_year
<chr>
1 2014
2 <NA>
3 2013
4 2012
5 March 2010
6 Sept 2009 - April 2010
7 May-August
8 2009/2010
9 2015
10 2011
# … with 37 more rows
harvest_years %>% datatable()
coffee_b <-
coffee_a %>%
filter(!harvest_year %in% c("May-August",
"mmm",
"TEST",
"January Through April",
"August to December",
"Mayo a Julio",
"Abril - Julio")) %>%
mutate(harvest_year = str_to_lower(harvest_year),
harvest_year = str_replace_all(harvest_year, "[[:punct:]]", " "),
#harvest_year_num = str_extract(harvest_year, "\\d{4}"))
harvest_year = str_replace_all(harvest_year, "4t 10", "2010"),
harvest_year = str_replace_all(harvest_year, "4t 2011", "2011"),
harvest_year = str_replace_all(harvest_year, "4t 2010", "2010"),
harvest_year = str_replace_all(harvest_year, "47 2010", "2010"),
harvest_year = str_replace_all(harvest_year, "23 july 2010", "2010"),
harvest_year = str_replace_all(harvest_year, "3t 2011", "2011"),
harvest_year = str_replace_all(harvest_year, "4t72010", "2010"),
harvest_year = str_replace_all(harvest_year, "08 09 crop", "2009"),
harvest_year = str_replace_all(harvest_year, "1t 2011", "2011"),
harvest_year_num = as.factor(parse_number(harvest_year)))
# check
coffee_b %>%
select(harvest_year, harvest_year_num) %>%
unique() %>%
datatable()
ggplot(coffee_b,
aes(x = harvest_year_num,
y = total_cup_points)) +
geom_boxplot() +
ggtitle("Harvest years")
# leave it as it is
coffee_b %>%
# filter(is.na(harvest_year_num)) %>%
select(harvest_year, grading_date)
# A tibble: 1,301 × 2
harvest_year grading_date
<chr> <date>
1 2014 2015-04-04
2 2014 2015-04-04
3 <NA> 2010-05-31
4 2014 2015-03-26
5 2014 2015-04-04
6 2013 2013-09-03
7 2012 2012-09-17
8 march 2010 2010-09-02
9 march 2010 2010-09-02
10 2014 2015-03-30
# … with 1,291 more rows
There are some obvious outliers for altitude, which should be removed.
altitude <- coffee_b %>%
select(starts_with("altitude"), unit_of_measurement )
altitude %>% datatable()
coffee_c <-
coffee_b %>%
select(-altitude_low_meters,
-altitude_high_meters, -harvest_year,
-altitude, -unit_of_measurement) %>%
# remove obvious outliers
filter(altitude_mean_meters < 5000 & altitude_mean_meters > 50)
coffee_c %>%
ggplot(aes(altitude_mean_meters)) +
geom_boxplot()
skim(coffee_c)
Name | coffee_c |
Number of rows | 1054 |
Number of columns | 30 |
_______________________ | |
Column type frequency: | |
Date | 2 |
factor | 11 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
grading_date | 0 | 1 | 2010-04-09 | 2018-01-19 | 2014-04-30 | 456 |
expiration | 0 | 1 | 2011-04-09 | 2019-01-19 | 2015-04-30 | 456 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 1 | Ara: 1054 |
owner | 7 | 0.99 | FALSE | 277 | jua: 133, exp: 53, cqi: 41, ipa: 39 |
country_of_origin | 0 | 1.00 | FALSE | 35 | Mex: 230, Gua: 154, Col: 145, Bra: 91 |
company | 141 | 0.87 | FALSE | 244 | une: 71, exp: 39, ipa: 39, rac: 26 |
region | 3 | 1.00 | FALSE | 322 | hui: 91, ori: 65, sou: 55, ver: 31 |
in_country_partner | 0 | 1.00 | FALSE | 26 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
variety | 83 | 0.92 | FALSE | 27 | Cat: 234, Typ: 207, Bou: 205, Oth: 98 |
processing_method | 69 | 0.93 | FALSE | 5 | Was: 731, Nat: 167, Sem: 52, Oth: 25 |
color | 129 | 0.88 | FALSE | 4 | Gre: 732, Blu: 81, Blu: 68, Non: 44 |
certification_body | 0 | 1.00 | FALSE | 19 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
harvest_year_num | 12 | 0.99 | FALSE | 10 | 201: 292, 201: 216, 201: 171, 201: 128 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1 | 82.11 | 3.67 | 0 | 81.17 | 82.50 | 83.67 | 90.58 | ▁▁▁▁▇ |
number_of_bags | 0 | 1 | 156.21 | 128.55 | 0 | 18.00 | 200.00 | 275.00 | 1062.00 | ▇▇▁▁▁ |
aroma | 0 | 1 | 7.57 | 0.39 | 0 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
flavor | 0 | 1 | 7.52 | 0.41 | 0 | 7.33 | 7.50 | 7.75 | 8.83 | ▁▁▁▁▇ |
aftertaste | 0 | 1 | 7.39 | 0.41 | 0 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▁▁▁▇ |
acidity | 0 | 1 | 7.53 | 0.39 | 0 | 7.33 | 7.50 | 7.75 | 8.75 | ▁▁▁▁▇ |
body | 0 | 1 | 7.50 | 0.36 | 0 | 7.33 | 7.50 | 7.67 | 8.58 | ▁▁▁▁▇ |
balance | 0 | 1 | 7.50 | 0.42 | 0 | 7.33 | 7.50 | 7.67 | 8.75 | ▁▁▁▁▇ |
uniformity | 0 | 1 | 9.86 | 0.53 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
clean_cup | 0 | 1 | 9.85 | 0.80 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
sweetness | 0 | 1 | 9.92 | 0.52 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
cupper_points | 0 | 1 | 7.48 | 0.47 | 0 | 7.25 | 7.50 | 7.75 | 10.00 | ▁▁▁▇▁ |
moisture | 0 | 1 | 0.09 | 0.04 | 0 | 0.10 | 0.11 | 0.12 | 0.20 | ▂▁▇▁▁ |
category_one_defects | 0 | 1 | 0.38 | 1.88 | 0 | 0.00 | 0.00 | 0.00 | 31.00 | ▇▁▁▁▁ |
quakers | 1 | 1 | 0.14 | 0.72 | 0 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1 | 3.59 | 5.25 | 0 | 0.00 | 2.00 | 4.00 | 47.00 | ▇▁▁▁▁ |
altitude_mean_meters | 0 | 1 | 1347.24 | 448.25 | 100 | 1100.00 | 1320.00 | 1600.00 | 4287.00 | ▂▇▁▁▁ |
plot_boxplot <- function(df, col){
{{df}} %>%
ggplot(aes({{col}})) +
geom_boxplot()
}
plot_boxplot(coffee_c, total_cup_points)
# remove the sample with cup points = 0
coffee_c %>%
filter(total_cup_points == 0)
# A tibble: 1 × 30
total_cup_points species owner country_of_orig… company region
<dbl> <fct> <fct> <fct> <fct> <fct>
1 0 Arabica bismarck c… Honduras cigrah… comay…
# … with 24 more variables: number_of_bags <dbl>,
# in_country_partner <fct>, grading_date <date>, variety <fct>,
# processing_method <fct>, aroma <dbl>, flavor <dbl>,
# aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
# uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>,
# cupper_points <dbl>, moisture <dbl>, category_one_defects <dbl>,
# quakers <dbl>, color <fct>, category_two_defects <dbl>, …
coffee_d <-
coffee_c %>%
filter(total_cup_points != 0) %>%
select(-grading_date, -expiration,
-owner, -company, -region, -species)
skim(coffee_d)
Name | coffee_d |
Number of rows | 1053 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
factor | 7 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
country_of_origin | 0 | 1.00 | FALSE | 35 | Mex: 230, Gua: 154, Col: 145, Bra: 91 |
in_country_partner | 0 | 1.00 | FALSE | 26 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
variety | 83 | 0.92 | FALSE | 27 | Cat: 233, Typ: 207, Bou: 205, Oth: 98 |
processing_method | 68 | 0.94 | FALSE | 5 | Was: 731, Nat: 167, Sem: 52, Oth: 25 |
color | 129 | 0.88 | FALSE | 4 | Gre: 731, Blu: 81, Blu: 68, Non: 44 |
certification_body | 0 | 1.00 | FALSE | 19 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
harvest_year_num | 12 | 0.99 | FALSE | 10 | 201: 292, 201: 216, 201: 171, 201: 128 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1 | 82.19 | 2.66 | 59.83 | 81.17 | 82.50 | 83.67 | 90.58 | ▁▁▁▇▁ |
number_of_bags | 0 | 1 | 156.10 | 128.56 | 0.00 | 18.00 | 200.00 | 275.00 | 1062.00 | ▇▇▁▁▁ |
aroma | 0 | 1 | 7.58 | 0.31 | 5.08 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▂▇▁ |
flavor | 0 | 1 | 7.52 | 0.33 | 6.17 | 7.33 | 7.50 | 7.75 | 8.83 | ▁▂▇▂▁ |
aftertaste | 0 | 1 | 7.40 | 0.34 | 6.17 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▃▇▂▁ |
acidity | 0 | 1 | 7.53 | 0.31 | 5.25 | 7.33 | 7.50 | 7.75 | 8.75 | ▁▁▃▇▁ |
body | 0 | 1 | 7.51 | 0.28 | 6.33 | 7.33 | 7.50 | 7.67 | 8.58 | ▁▂▇▂▁ |
balance | 0 | 1 | 7.51 | 0.35 | 6.08 | 7.33 | 7.50 | 7.67 | 8.75 | ▁▂▇▃▁ |
uniformity | 0 | 1 | 9.87 | 0.44 | 6.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
clean_cup | 0 | 1 | 9.85 | 0.74 | 0.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
sweetness | 0 | 1 | 9.93 | 0.42 | 1.33 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
cupper_points | 0 | 1 | 7.48 | 0.41 | 5.17 | 7.25 | 7.50 | 7.75 | 10.00 | ▁▂▇▁▁ |
moisture | 0 | 1 | 0.09 | 0.04 | 0.00 | 0.10 | 0.11 | 0.12 | 0.20 | ▂▁▇▁▁ |
category_one_defects | 0 | 1 | 0.38 | 1.88 | 0.00 | 0.00 | 0.00 | 0.00 | 31.00 | ▇▁▁▁▁ |
quakers | 1 | 1 | 0.14 | 0.72 | 0.00 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1 | 3.59 | 5.25 | 0.00 | 0.00 | 2.00 | 4.00 | 47.00 | ▇▁▁▁▁ |
altitude_mean_meters | 0 | 1 | 1347.19 | 448.46 | 100.00 | 1100.00 | 1320.00 | 1600.00 | 4287.00 | ▂▇▁▁▁ |
plot_boxplot(coffee_d, total_cup_points)
plot_boxplot(coffee_d, number_of_bags)
plot_boxplot(coffee_d, moisture)
plot_boxplot(coffee_d, category_one_defects)
plot_boxplot(coffee_d, quakers)
plot_boxplot(coffee_d, category_two_defects)
# Univariate factor variables
coffee_d %>%
ggplot(aes(quakers)) +
geom_bar(stat = "count")
plot_bar <- function(col){
coffee_d %>%
ggplot(aes({{col}})) +
geom_bar(stat = "count") +
ggtitle(enquo(col))
}
plot_bar(quakers)
plot_bar(category_one_defects)
plot_bar(category_two_defects)
coffee_d %>%
ggplot(aes(total_cup_points)) +
geom_histogram()
plot_hist <- function(df, col){
{{df}} %>%
ggplot(aes({{col}})) +
geom_histogram() +
ggtitle(enquo(col))
}
plot_hist(coffee_d, total_cup_points)
plot_hist(coffee_d, altitude_mean_meters)
coffee_d %>%
ggplot(aes(x = altitude_mean_meters,
y = total_cup_points)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
se = F,
col = "red")
coffee_d %>%
ggplot(aes(x = moisture,
y = total_cup_points)) +
geom_point()
Dataset to be used for modelling:
coffee_cleaned <- coffee_d %>%
# mutate(log10_points = log10(total_cup_points)) %>%
mutate(variety = fct_explicit_na(variety)) %>%
mutate(processing_method = fct_explicit_na(processing_method)) %>%
mutate(color = na_if(color, "None")) %>%
mutate(color = fct_explicit_na(color)) %>%
select(total_cup_points,
country_of_origin,
in_country_partner,
variety,
processing_method,
color,
certification_body,
harvest_year_num,
moisture,
category_one_defects,
category_two_defects,
quakers,
altitude_mean_meters)
# plot variety by total cup points
coffee_cleaned %>%
ggplot(aes(fct_reorder(variety, total_cup_points),
total_cup_points)) +
geom_boxplot() +
coord_flip()
# create function
plot_ordered_boxplot <- function(df, col){
df %>%
ggplot(aes(fct_reorder({{col}}, total_cup_points),
total_cup_points)) +
geom_boxplot() +
coord_flip() +
ggtitle(enquo(col))
}
plot_ordered_boxplot(coffee_cleaned, country_of_origin)
plot_ordered_boxplot(coffee_cleaned, in_country_partner)
plot_ordered_boxplot(coffee_cleaned, variety)
plot_ordered_boxplot(coffee_cleaned, processing_method)
plot_ordered_boxplot(coffee_cleaned, color)
plot_ordered_boxplot(coffee_cleaned, certification_body)
# ggpairs-----
coffee_cleaned %>%
select(where(is.numeric)) %>%
ggpairs()
skim(coffee_cleaned)
Name | coffee_cleaned |
Number of rows | 1053 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
factor | 7 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
country_of_origin | 0 | 1.00 | FALSE | 35 | Mex: 230, Gua: 154, Col: 145, Bra: 91 |
in_country_partner | 0 | 1.00 | FALSE | 26 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
variety | 0 | 1.00 | FALSE | 28 | Cat: 233, Typ: 207, Bou: 205, Oth: 98 |
processing_method | 0 | 1.00 | FALSE | 6 | Was: 731, Nat: 167, (Mi: 68, Sem: 52 |
color | 0 | 1.00 | FALSE | 4 | Gre: 731, (Mi: 173, Blu: 81, Blu: 68 |
certification_body | 0 | 1.00 | FALSE | 19 | AME: 203, Spe: 173, Alm: 144, Aso: 133 |
harvest_year_num | 12 | 0.99 | FALSE | 10 | 201: 292, 201: 216, 201: 171, 201: 128 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1 | 82.19 | 2.66 | 59.83 | 81.17 | 82.50 | 83.67 | 90.58 | ▁▁▁▇▁ |
moisture | 0 | 1 | 0.09 | 0.04 | 0.00 | 0.10 | 0.11 | 0.12 | 0.20 | ▂▁▇▁▁ |
category_one_defects | 0 | 1 | 0.38 | 1.88 | 0.00 | 0.00 | 0.00 | 0.00 | 31.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1 | 3.59 | 5.25 | 0.00 | 0.00 | 2.00 | 4.00 | 47.00 | ▇▁▁▁▁ |
quakers | 1 | 1 | 0.14 | 0.72 | 0.00 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
altitude_mean_meters | 0 | 1 | 1347.19 | 448.46 | 100.00 | 1100.00 | 1320.00 | 1600.00 | 4287.00 | ▂▇▁▁▁ |
I will only be using random forest to model, and the preprocessing recipe is quite simple as random forest is insensitive to skewed distributions. However, missing values must be imputed.
set.seed(22071301)
coffee_split <-
coffee_cleaned %>%
initial_split()
coffee_train <-
coffee_split %>%
training()
coffee_test <-
coffee_split %>%
testing()
recipe_rf <-
recipe(total_cup_points ~.,
data = coffee_train) %>%
step_impute_knn(all_predictors()) %>%
step_poly(altitude_mean_meters, degree = 2)
RF <-
rand_forest() %>%
set_args(mtry = tune(),
min_n = tune(),
trees = 1000) %>%
set_engine("ranger",
importance = "permutation") %>%
set_mode("regression")
workflow_rf <-
workflow() %>%
add_recipe(recipe_rf) %>%
add_model(RF)
set.seed(22071302)
CV <-
coffee_train %>%
vfold_cv(v=10)
tuned_rf <-
workflow_rf %>%
tune_grid(resamples = CV)
tuned_rf %>%
autoplot()
parameters_tuned_rf <-
tuned_rf %>%
select_best(metric = "rmse")
best_workflow_rf <-
workflow_rf %>%
finalize_workflow(parameters_tuned_rf)
fit_rf <-
best_workflow_rf %>%
last_fit(coffee_split)
model_performance_rf <-
fit_rf %>%
collect_metrics()
model_performance_rf # rmse = 2.07, r-sq = 0.267
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 2.06 Preprocessor1_Model1
2 rsq standard 0.269 Preprocessor1_Model1
predictions_rf <-
fit_rf %>%
collect_predictions()
predictions_plot <- predictions_rf %>%
ggplot(aes(x = total_cup_points,
y = .pred)) +
geom_point(color = "midnightblue",
alpha = 0.5) +
geom_abline(color = "red",
lty = 2) +
labs(x = "Actual Total Cupping Points",
y = "Predicted Total Cupping Points") +
tune::coord_obs_pred()
plotly::ggplotly(predictions_plot)
The predictions were quite off for those with actual scores below 80, and the model tend to over-estimate the cupping scores.
finalized_model <-
best_workflow_rf %>%
fit(coffee_cleaned)
feat_imp_model <-
RF %>%
finalize_model(select_best(tuned_rf)) %>%
set_engine("ranger",
importance = "permutation")
feature_importance <-
workflow() %>%
add_recipe(recipe_rf) %>%
add_model(feat_imp_model) %>%
fit(coffee_train) %>%
extract_fit_parsnip() %>%
vip()
feature_importance
For attribution, please cite this work as
lruolin (2022, June 25). pRactice corner: Predicting coffee cupping scores. Retrieved from https://lruolin.github.io/myBlog/posts/20220625 - Predict coffee cupping score/
BibTeX citation
@misc{lruolin2022predicting, author = {lruolin, }, title = {pRactice corner: Predicting coffee cupping scores}, url = {https://lruolin.github.io/myBlog/posts/20220625 - Predict coffee cupping score/}, year = {2022} }